Bank Marketing Data Analysis
1. Introduction
1-1. About Dataset
Our dataset is related to telemarketing campaigns held by a Portuguese banking institution. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit, our variable 'y') would be ('yes') or not ('no') subscribed.
1-2. Problem/Task description
Main Problem Statement: Predict if the bank client will subscribe a term deposit(yes/no, binary classification) based on the client's basic information, telemarketing campaign data, and social & economic data.
Purpose of the project (Why we need this project?):
In the process of building this prediction model, we can also identify what type of clients are more likely to respond positive to the marketing campaign (=identify the factors that makes bank marketing more successful). Nowadays, there are so many new ways of marketing but the one thing all companies always expect from marketing campaign is "Maximum revenue(outcome) with minimum cost that would be wasted on approaching non(negative)-responding clients". Especially, in direct marketing such as telemarketing, it's sometimes costly and the response rate is quite low. Therefore, 'Targeting the right client' must be one of the most important things in this kind of direct marketing. Considering this, main goal of this project is to make a prediction model which can suggest a proper direction, right target client, and strategy for successful cross-selling of the bank products such as term deposit.
Hypothesis & Research Questions:
- Client with loan has less chance to subscribe a term deposit than the other.
- Client with ceratin type of 'job' has more chance to subscribe a term deposit than the other.
1-3. Feature Explanation
Input variables:
Client Personal Info Variables:
1 - age (numeric)
2 - job : type of job (categorical:'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
Variables with other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
Output variable (Target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
Import Libraries
You can unfold the code if you want to check what libraries are used
suppressPackageStartupMessages({
library(tidyverse)
library(rapportools)
library(data.table) # datafile reading
library(dplyr)
library(dlookr)
library(magrittr)
library(tidyr)
library(lubridate)
library(DT)
library(gmodels)
library(sqldf)
library(caret) # rocr analysis
library(ROCR) # rocr analysis
library(kableExtra) # nice table html formating
library(gridExtra) # arranging ggplot in grid
library(rpart) # decision tree
library(rpart.plot) # decision tree plotting
library(caTools) # split
library(randomForest) #randomforests
library(class) # knn
library(ipred) # bagging
library(adabag) # boosting
library(purrr)
library(fastDummies)
library(DMwR)
#visualization
library(ggplot2)
library(ggmosaic)
library(corrplot)
library(ggthemes)
library(sqldf)
library(readr)
library(plotly)
library(cowplot)
library(plotROC)
library(ggpubr)
library(C50)
library(e1071)
library(pROC)
library(PRROC)
library(rattle)
library(RColorBrewer)
library(forecast)
library(zoo)
library(DescTools)
library(knitr)
library(VIM)
library(mice)
library(psych)
library(gmodels)
library(caTools)
library(Epi)
library(tidyverse)
})Load Dataset
bank <- read.csv(file = "/Users/jiyoungkim/Desktop/SGH 2 sem/SLM class/Project/bank-additional/bank-additional-full.csv",
sep =";", stringsAsFactors = T)
str(bank)'data.frame': 41188 obs. of 21 variables:
$ age : int 56 57 37 40 56 45 59 41 24 25 ...
$ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
$ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
$ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
$ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
$ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
$ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
$ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
$ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
$ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
$ duration : int 261 149 226 151 307 198 139 217 380 50 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
$ previous : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
$ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
$ cons.price.idx: num 94 94 94 94 94 ...
$ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
$ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
$ nr.employed : num 5191 5191 5191 5191 5191 ...
$ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
summary(bank) age job marital
Min. :17.00 admin. :10422 divorced: 4612
1st Qu.:32.00 blue-collar: 9254 married :24928
Median :38.00 technician : 6743 single :11568
education default housing loan
university.degree :12168 no :32588 no :18622 no :33950
high.school : 9515 unknown: 8597 unknown: 990 unknown: 990
basic.9y : 6045 yes : 3 yes :21576 yes : 6248
contact month day_of_week duration
cellular :26144 may :13769 fri:7827 Min. : 0.0
telephone:15044 jul : 7174 mon:8514 1st Qu.: 102.0
aug : 6178 thu:8623 Median : 180.0
campaign pdays previous poutcome
Min. : 1.000 Min. : 0.0 Min. :0.000 failure : 4252
1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 nonexistent:35563
Median : 2.000 Median :999.0 Median :0.000 success : 1373
emp.var.rate cons.price.idx cons.conf.idx euribor3m
Min. :-3.40000 Min. :92.20 Min. :-50.8 Min. :0.634
1st Qu.:-1.80000 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344
Median : 1.10000 Median :93.75 Median :-41.8 Median :4.857
nr.employed y
Min. :4964 no :36548
1st Qu.:5099 yes: 4640
Median :5191
[ reached getOption("max.print") -- omitted 4 rows ]
2. Data Preprocessing & Cleaning
2-1. Duplicate Rows
sum(duplicated(bank))[1] 12
Let's drop all these duplicated rows
bank = bank %>% distinct2-2. Missing values
As checking the missing values, NA is not the only missing ones, we actually need to check several possible cases as below:
- NA values
- Blank/Space values
- Unknown values
sum(is.na(bank)) # turns out there is no NA[1] 0
sum(is.empty(" ", trim = FALSE)) # turns out there is no Blank/Space values[1] 0
Even if we don't have any NA or blank values, the table below shows that there are still some variables with a large portion of 'unknown' values
colSums(bank=="unknown") age job marital education default
0 330 80 1730 8596
housing loan contact month day_of_week
990 990 0 0 0
duration campaign pdays previous poutcome
0 0 0 0 0
emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
0 0 0 0 0
y
0
It's such a large number to just drop all of those 'unknown' values. Considering that those values take around 30% portion out of whole obs((12718/41188)*100 = 30.88%), deleting all of them at once is quite huge sacrifice and may even affect the distribution of our Target Variable 'y'./
Therefore, I would rather decide how to deal with these missing values only after I check the distribution of 'y' in those 'unknown' values by each variable. And then, I will check the frequency distribution within each variable who has 'unknown' to decide ideal value for imputation.
2-3. Outliers
Diagnose outlier of numerical variables (in here, with_mean = arithmetic average of including outliers in the variable)
diagnose_outlier(bank) %>%
mutate(outliers = outliers_mean / with_mean) %>%
arrange(desc(outliers)) variables outliers_cnt outliers_ratio outliers_mean with_mean
1 previous 5625 13.660870 1.266489 0.17301341
2 campaign 2406 5.843210 11.004156 2.56787935
3 duration 2963 7.195939 968.217010 258.31581504
4 age 468 1.136584 76.940171 40.02380027
5 cons.conf.idx 446 1.083155 -26.900000 -40.50286332
6 pdays 1515 3.679328 6.014521 962.46480960
7 emp.var.rate 0 0.000000 NaN 0.08192151
8 cons.price.idx 0 0.000000 NaN 93.57571989
9 euribor3m 0 0.000000 NaN 3.62129345
10 nr.employed 0 0.000000 NaN 5167.03486983
without_mean outliers
1 0.00000000 7.320177778
2 2.04433841 4.285308922
3 203.27074556 3.748190987
4 39.59939078 1.922360456
5 -40.65181684 0.664150576
6 999.00000000 0.006249082
7 0.08192151 NaN
8 93.57571989 NaN
9 3.62129345 NaN
10 5167.03486983 NaN
Based on the 'outliers_ratio' and outliers score (= outliers_mean/with_mean), 'previous', 'campaign', 'duration' seems top 3 variables with quite severe outliers. Let's decide either Delete or Transform by looking at the distribution plot of these 3 variables.
bank %>%
plot_outlier(diagnose_outlier(bank) %>%
filter(outliers_ratio >= 5) %>%
select(variables) %>%
unlist()) Summary of Outliers Analysis
1) Variable 'duration': This variable represents the last contact duration, in seconds (numeric). However, as warned, this variable is inappropriate to include as one of X variables for our prediction model because it's not possible to know the duration before a call is performed. Therefore, we will just delete this variable.
2) Variable 'campaign': This variable represents the number of contacts performed during this campaign and for this client (including previous contacts). Even if there are some values including last contacts, it's too much or not very useful to have the contacts more than 10 times for one client, in marketing aspect. Therefore, I decided to drop all the values >10 in this variable.
3) Variable 'previous': Most of the value is concentrated on '0' which means that there were no any other contacts before this campaign and for this client. So, I decided to binning this variable into a categorical variable with 2 level like No (0) / Yes(1) (in the meaning of was there any previous contact or no) since there are so small proportion of value over 1 time! Accordingly, their data type will also be changed from numeric to the factor with 2 levels
2-3.1. Transform variables based on Outliers analysis results
# Drop the variable 'duration'
bank_final = bank %>%
select(-duration)
# Drop all the values > 10 within the variable 'campaign'
bank_final = bank_final %>%
filter(campaign <= 10)
# Binning the variable 'previous' (num -> binary categorical variable)
bank_final = bank_final %>%
mutate(previous = if_else(previous >= 1, "1", "0"))
bank_final$previous <- as.factor(bank_final$previous)3. EDA
Now, let's deal with additional Data Preprocessing & Cleaning step (like dealing with missing data) aligned with EDA part.
3-1. Univariate, Bivariate Analysis
Target Variable
We can see that our data is highly imbalanced like Yes:No = 11%:89% which means that we have too small proportion of 'Yes' to use this data for modelling. Therefore, after EDA part, as we prepare our dataset for modelling, we will use SMOTE (Synthetic Minority Over-Sampling Technique) to deal with this Imabalanced data issue.
ggplot(data = bank_final, aes(x=y, label = scales::percent(prop.table(stat(count)))))+
geom_bar(stat = "count", aes(fill=y))+
geom_text(stat = 'count', position = position_dodge(.9), vjust = -0.5, size = 3.5) +
scale_y_continuous(labels = scales::percent) +
stat_count(geom = "text", colour = "black", size = 3.5, aes(label = ..count..),position=position_stack(vjust=0.5))+
labs(title="Distribution of y variable (Deposit)", x="Deposit", y="Count") + scale_fill_brewer(palette="Set3")Categorical Variables
Just as an overview, let's start with checking the levels with the biggest proportion within each categorical variables.
diagnose_category(bank_final) %>%
filter(rank == 1)# A tibble: 12 x 6
variables levels N freq ratio rank
<chr> <fct> <int> <int> <dbl> <int>
1 job admin. 40307 10172 25.2 1
2 marital married 40307 24386 60.5 1
3 education university.degree 40307 11904 29.5 1
4 default no 40307 31943 79.2 1
5 housing yes 40307 21149 52.5 1
6 loan no 40307 33217 82.4 1
7 contact cellular 40307 25732 63.8 1
8 month may 40307 13592 33.7 1
9 day_of_week thu 40307 8389 20.8 1
10 previous 0 40307 34691 86.1 1
11 poutcome nonexistent 40307 34691 86.1 1
12 y no 40307 35695 88.6 1
Based on the result, we can glimpse what kind of attributes are represented by our clients pool. For example, we can see a large portion of clients who have admin jobs, married, university degree, no credit in default, housing loan, no personal loan, had cellularphone for contact, last contact in May, last contact in thursday, and had no previous contact before this campaign
1) Job
job_y <- sqldf(
"select job, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select job, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
job_y job total deposit_yes deposit_no yes_ratio no_ratio
1 student 868 275 593 31.68 68.32
2 retired 1684 432 1252 25.65 74.35
3 unemployed 990 144 846 14.55 85.45
4 admin. 10172 1342 8830 13.19 86.81
5 management 2868 328 2540 11.44 88.56
6 unknown 321 36 285 11.21 88.79
7 technician 6611 728 5883 11.01 88.99
8 self-employed 1389 148 1241 10.66 89.34
9 housemaid 1032 106 926 10.27 89.73
10 entrepreneur 1424 124 1300 8.71 91.29
11 services 3886 322 3564 8.29 91.71
12 blue-collar 9062 627 8435 6.92 93.08
In the table above, you can see the 'unknown' level (which is regarded as kind of 'missing value') shows nothing unique or insightful in here. They have similar distribution like other levels and nothing special, so it will not hurt our dataset very much by dropping this unknown values in 'job' variable. Let's drop it.
bank_final = bank_final %>% filter(job != "unknown")
bank_final$job <- factor(bank_final$job)Below plot is the overall distribution of variable 'job' after dropping those 330 unknown values. (visualized plot and data table)
# Set up default parameters for mosaic plots
mosaic_theme = theme(axis.text.x = element_text(angle = 90,
hjust = 1,
vjust = 0.5),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
job_vis <- bank_final %>%
ggplot() +
geom_mosaic(aes(x = product(y, job), fill = y)) + mosaic_theme +
xlab("job") + scale_fill_brewer(palette="Set2")
ggplotly(job_vis)job_y_new <- sqldf(
"select job, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select job, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
job_y_new job total deposit_yes deposit_no yes_ratio no_ratio
1 student 868 275 593 31.68 68.32
2 retired 1684 432 1252 25.65 74.35
3 unemployed 990 144 846 14.55 85.45
4 admin. 10172 1342 8830 13.19 86.81
5 management 2868 328 2540 11.44 88.56
6 technician 6611 728 5883 11.01 88.99
7 self-employed 1389 148 1241 10.66 89.34
8 housemaid 1032 106 926 10.27 89.73
9 entrepreneur 1424 124 1300 8.71 91.29
10 services 3886 322 3564 8.29 91.71
11 blue-collar 9062 627 8435 6.92 93.08
Insights from 'Job' variable
- As we can see above, clients who are student(31.7%) or retired(25.7%) had subscribed the 'Deposit (yes)' relatively more than clients with other jobs.
- Since we have high positive deposit subscription ratio from retired clients, it might be also good idea to think of creating another cross-selling product (it can be loan, fund, etc...) customized to retired clients.
- One of our hypothesis that we set at the beginning, "Client with ceratin type of 'job' has more chance to subscribe a term deposit than the other." turns out to be true. And the result shows the clients who are student or retired or unemployed are more likely to subscribe the Deposit.
2) Marital
marital_y <- sqldf(
"select marital, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select marital, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
marital_y marital total deposit_yes deposit_no yes_ratio no_ratio
1 single 11254 1596 9658 14.18 85.82
2 unknown 68 9 59 13.24 86.76
3 divorced 4506 471 4035 10.45 89.55
4 married 24158 2500 21658 10.35 89.65
Also in the 'marital' variable, 'unknown' level doesn't give any special insights and it's really small portion as 68 obs. So, it doesn't harm our data distribution even if we delete it.
marital_v <- bank_final %>%
ggplot() +
geom_mosaic(aes(x = product(y, marital), fill = y)) + mosaic_theme +
xlab("marital status") + ylab("Deposit Subscription") + scale_fill_brewer(palette="Set2")
ggplotly(marital_v)marital_y_new <- sqldf(
"select marital, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select marital, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
marital_y_new marital total deposit_yes deposit_no yes_ratio no_ratio
1 single 11254 1596 9658 14.18 85.82
2 divorced 4506 471 4035 10.45 89.55
3 married 24158 2500 21658 10.35 89.65
Insights from 'Marital' variable
As you can see, there is no severe difference in Deposit subscription ratio between single(14.18%), divorced(10.45%), married(10.35%) clients. Single clients are only slightly more subscribing Deposit.
3) Education
education_y <- sqldf(
"select education, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select education, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
education_y education total deposit_yes deposit_no yes_ratio no_ratio
1 illiterate 18 4 14 22.22 77.78
2 unknown 1556 231 1325 14.85 85.15
3 university.degree 11832 1645 10187 13.90 86.10
4 professional.course 5118 591 4527 11.55 88.45
5 high.school 9249 1024 8225 11.07 88.93
6 basic.4y 4028 420 3608 10.43 89.57
7 basic.6y 2225 187 2038 8.40 91.60
8 basic.9y 5892 465 5427 7.89 92.11
There are two things we need to preprocess based on above table result.
1) Need to drop 'illiterate' variable which has risk of giving wrong influence on our modelling with extremely small obs. I will drop it.
2) Different from previous categorical variables, 'Education' variable actually has quite large amount of 'unknown' level(1556) and Deposit (yes) ratio within it is also quite high comparing to other level. Therefore, we will integrate this unknown level into the university.degree which has 3rd highest Deposit(yes)_ratio right below unknown and has very similar distribution of y with unknown level.
# Drop 'illiterate' level
bank_final = bank_final %>% filter(education != "illiterate")
# Integrate 'unknown' level into 'university.degree' level
bank_final = bank_final %>%
dplyr::mutate(education = recode(education, "unknown" = "university.degree"))
bank_final$education <- factor(bank_final$education)After preprocessing those 2 points, the distribution looks like below.
education_y_new <- sqldf(
"select education, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select education, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
education_y_new education total deposit_yes deposit_no yes_ratio no_ratio
1 university.degree 13388 1876 11512 14.01 85.99
2 professional.course 5118 591 4527 11.55 88.45
3 high.school 9249 1024 8225 11.07 88.93
4 basic.4y 4028 420 3608 10.43 89.57
5 basic.6y 2225 187 2038 8.40 91.60
6 basic.9y 5892 465 5427 7.89 92.11
education_vis<- ggplot(education_y_new, aes(education, yes_ratio)) +
geom_point(aes(color = education, size = total)) +
theme(axis.text.x = element_text(angle = 45, hjust=1)) +
labs(title = "Deposit(yes) ratio by Education",
x = "Education", y = "Deposit(yes) ratio(%)") + theme(text = element_text(face = "bold")) +
scale_colour_brewer()
ggplotly(education_vis)Clients with university degree has the highest ratio of Deposit Subscription followed by professional.course and high school.
4) Default
# setting default parameters for crosstables
crosstable_function = function(df, var1, var2){
# df: dataframe containing both columns to cross
# var1, var2: columns to cross together.
CrossTable(df[, var1], df[, var2],
prop.r = T,
prop.c = F,
prop.t = F,
prop.chisq = F,
dnn = c(var1, var2))
}
crosstable_function(bank_final, "default", "y")
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 39900
| y
default | no | yes | Row Total |
-------------|-----------|-----------|-----------|
no | 27564 | 4134 | 31698 |
| 0.870 | 0.130 | 0.794 |
-------------|-----------|-----------|-----------|
unknown | 7770 | 429 | 8199 |
| 0.948 | 0.052 | 0.205 |
-------------|-----------|-----------|-----------|
yes | 3 | 0 | 3 |
| 1.000 | 0.000 | 0.000 |
-------------|-----------|-----------|-----------|
Column Total | 35337 | 4563 | 39900 |
-------------|-----------|-----------|-----------|
For 'Default'(credit in default?) variable, since they only have 3 clients with 'Yes' for deposit, it is literally hard to use for our modelling for analysis even if we do the Oversampling. So, I decided to drop this variable.
# Dropping the variable 'default'
bank_final = bank_final %>%
select(-default)5) Housing
housing_y <- sqldf(
"select housing, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select housing, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
housing_y housing total deposit_yes deposit_no yes_ratio no_ratio
1 yes 20944 2468 18476 11.78 88.22
2 unknown 964 107 857 11.10 88.90
3 no 17992 1988 16004 11.05 88.95
we will just drop the 'unknown' level of housing variable since it doesn't give much special information.
# Dropping 'unknown' level from the variable 'housing'
bank_final = bank_final %>% filter(housing != "unknown")
bank_final$housing <- factor(bank_final$housing)housing_v <- bank_final %>%
ggplot() +
geom_mosaic(aes(x = product(y, housing), fill = y)) + mosaic_theme +
xlab("housing loan") + ylab("Deposit Subscription") + scale_fill_brewer(palette="Set4")
ggplotly(housing_v)6) Loan
bank_final$loan <- factor(bank_final$loan)
loan_y <- sqldf(
"select loan, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select loan, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
loan_y loan total deposit_yes deposit_no yes_ratio no_ratio
1 no 32882 3781 29101 11.50 88.50
2 yes 6054 675 5379 11.15 88.85
loan_v <- bank_final %>%
ggplot() +
geom_mosaic(aes(x = product(y, loan), fill = y)) + mosaic_theme +
xlab("Loan") + ylab("Deposit Subscription") + scale_fill_brewer(palette="Set2")
ggplotly(loan_v)8) Month
month_y <- sqldf(
"select month, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select month, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
month_y month total deposit_yes deposit_no yes_ratio no_ratio
1 mar 521 268 253 51.44 48.56
2 dec 174 85 89 48.85 51.15
3 sep 545 244 301 44.77 55.23
4 oct 686 302 384 44.02 55.98
5 apr 2549 519 2030 20.36 79.64
6 jun 4851 534 4317 11.01 88.99
7 aug 5829 622 5207 10.67 89.33
8 nov 4006 405 3601 10.11 89.89
9 jul 6672 618 6054 9.26 90.74
10 may 13103 859 12244 6.56 93.44
bank_final$month <- factor(bank_final$month, levels=c("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"))
month_vis <- ggplot(bank_final, aes(month, fill=y)) + geom_bar(position="stack") + scale_fill_brewer(palette="Set2") + ylab("Deposit(yes/no)") + ggtitle("Distribution of Client and Deposit by Last contact Month")
ggplotly(month_vis)month_y$month <- factor(month_y$month, levels = month_y$month[order(-month_y$yes_ratio)])
month_vis2 <- ggplot(month_y, aes(x=month, y=yes_ratio)) +
geom_bar(width=0.5, stat = "identity", fill="#fb8d61") +
labs(title="Deposit(yes) ratio by Last Contact Month") +
theme(axis.text.x = element_text(angle = 90))
ggplotly(month_vis2)9) Day of Week
weekday_y <- sqldf(
"select day_of_week, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select day_of_week, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
weekday_y day_of_week total deposit_yes deposit_no yes_ratio no_ratio
1 thu 8106 997 7109 12.30 87.70
2 tue 7645 909 6736 11.89 88.11
3 wed 7714 917 6797 11.89 88.11
4 fri 7393 815 6578 11.02 88.98
5 mon 8078 818 7260 10.13 89.87
bank_final$day_of_week <- factor(bank_final$day_of_week, levels=c("mon", "tue", "wed", "thu", "fri"))
weekday_vis <- ggplot(bank_final, aes(day_of_week, fill=y)) + geom_bar(position="stack") + ylab("Deposit(yes/no)") + ggtitle("Distribution of Client and Deposit by Day of Week")
ggplotly(weekday_vis)weekday_y$day_of_week <- factor(weekday_y$day_of_week, levels = weekday_y$day_of_week[order(-weekday_y$yes_ratio)])
weekday_vis2 <- ggplot(weekday_y, aes(x=day_of_week, y=yes_ratio)) +
geom_bar(width=0.5, stat = "identity", fill="#2ebfc4") +
labs(title="Deposit(yes) ratio by Day of Week") +
theme(axis.text.x = element_text(angle = 90))
ggplotly(weekday_vis2)The result shows that there are certain week days that have more positive response on Deposit subscription such as Thursday, Tuesday, Wednesday. However, in overall, there are not much prominent gap of Deposit (yes/no) between each week day.
10) Previous
crosstable_function(bank_final, "previous", "y")
Cell Contents
|-------------------------|
| N |
| N / Row Total |
|-------------------------|
Total Observations in Table: 38936
| y
previous | no | yes | Row Total |
-------------|-----------|-----------|-----------|
0 | 30494 | 3021 | 33515 |
| 0.910 | 0.090 | 0.861 |
-------------|-----------|-----------|-----------|
1 | 3986 | 1435 | 5421 |
| 0.735 | 0.265 | 0.139 |
-------------|-----------|-----------|-----------|
Column Total | 34480 | 4456 | 38936 |
-------------|-----------|-----------|-----------|
previous_vis <- bank_final %>%
ggplot() +
geom_mosaic(aes(x = product(y, previous), fill = y)) + mosaic_theme +
xlab("previous contact(yes/no)") + ylab("Deposit Subscription") + scale_fill_brewer(palette="Set3")
ggplotly(previous_vis)The results shows that the clients who have got contacted previously before this campaign are more likely to subscribe Deposit(yes). So we can say that re-reaching to the clients can be regarded as quite important in this telemarketing campaign.
10) Poutcome
poutcome_y <- sqldf(
"select poutcome, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select poutcome, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
poutcome_y poutcome total deposit_yes deposit_no yes_ratio no_ratio
1 success 1320 863 457 65.38 34.62
2 failure 4101 572 3529 13.95 86.05
3 nonexistent 33515 3021 30494 9.01 90.99
poutcome_vis <- bank_final %>%
ggplot() +
geom_mosaic(aes(x = product(y, poutcome), fill = y)) + mosaic_theme +
xlab("previous campaign outcome") + ylab("Deposit Subscription") + scale_fill_brewer(palette="Set2")
ggplotly(poutcome_vis)Based on the above results, we can conclude that the clients who had contacted for 'previous campaign' at least once have more chance to positively respond and subscribe 'Deposit'. Even if the previous outcome was failure, it still shows higher deposit(yes) ratio than the clients who have never been contacted for marketing.
Numerical Variables
For numerical variable, we will not check every each of them, but rather focus on some variables with important insights.
1) Age
age_y <- sqldf(
"select age, total, deposit_yes, deposit_no, round(100.*deposit_yes/total, 2) as yes_ratio, round(100.*deposit_no/total, 2) as no_ratio
from(
select age, count(*) as total, sum(case when y = 'yes' then 1 else 0 end) as deposit_yes, sum(case when y = 'no' then 1 else 0 end) as deposit_no
from bank_final
group by 1
) order by yes_ratio DESC")
age_y age total deposit_yes deposit_no yes_ratio no_ratio
1 87 1 1 0 100.00 0.00
2 89 2 2 0 100.00 0.00
3 98 2 2 0 100.00 0.00
4 92 4 3 1 75.00 25.00
5 86 7 5 2 71.43 28.57
6 77 19 12 7 63.16 36.84
7 82 16 10 6 62.50 37.50
8 80 30 17 13 56.67 43.33
9 79 13 7 6 53.85 46.15
10 76 32 17 15 53.13 46.88
11 78 25 13 12 52.00 48.00
12 66 49 25 24 51.02 48.98
[ reached 'max' / getOption("max.print") -- omitted 66 rows ]
boxplot(bank_final$age, main="Age plot", yaxt="n", xlab="age", horizontal=TRUE)ggplot(bank_final,aes(x=bank_final$age,fill=bank_final$y)) + geom_histogram(binwidth=1) +
labs(y= "Count", x="Age", title = "Age")Based on the plots above, we can think of 2 marketing campaign strategies. 1) targetting mid-young~mid aged people like 30s-40s.; 2) Think of the strategies to target aged clients or very young clients like 20s. Especially, I think it's good to target young clients in 20s because the Deposit distribution shows that clients from 25 y.o. are starting to increase in subscribing Deposit. And retaining young clients are beneficial to the bank since they can turn into the long-term loyal clients once the rapport is built.
Anyway, I will tranform this Age variable(numeric) into categorical variable with grouping age range as 'Low/Mid/High'
bank_final = bank_final %>%
mutate(age = if_else(age > 60, "high", if_else(age > 30, "mid", "low")))
bank_final$age <- factor(bank_final$age)Histogram plot of all numeric variables in overall
bank_final %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()Based on what we see in this histogram plot, we can check the variable campaign and pdays in detail since pdays somehow shows outlier distribution.
2) Campaign
campaign_vis <- bank_final %>%
ggplot() +
geom_mosaic(aes(x = product(y, campaign), fill = y)) + mosaic_theme +
xlab("number of contacts during this campaign") + ylab("Deposit Subscription") + scale_fill_brewer(palette="Set2")
ggplotly(campaign_vis)3) Pdays
ggplot(data = bank_final, aes(x=bank_final$pdays))+
geom_histogram(bins=30)+
labs(x="Number of days since last contact", y="Count") If you see the pdays distribution, it is absolutely showing the distribution that is hard to use for our analysis since most of the values are 999 which means they have not previously contacted. This happened maybe because we dropped some variables and unknown levels in the previous steps while doing EDA. Anyway, we already have very similar variable called 'previous' which shows if the client was contacted last time(before this campaign) or not. So, we can just drop this pdays variable.
bank_final = bank_final %>%
select(-pdays)3-2. Correlation
Correlation plot of numeric variables.
bank_num <-select_if(bank_final, is.numeric) # only selecting numeric variables
bank_num %>%
cor() %>%
corrplot(method = "number",
type = "lower",
tl.cex = 0.8,
tl.srt = 45) We can see there is high correlation between euribor3m and emp.var.rate / cons.price.idx and emp.var.rate / emp.var.rate and nr.employed / euribor3m and nr.employed. Most variables in social, economic attribute variables show high correlation each other and it's of course thing. So, in here, I will drop the variable emp.var.rate and euribor3m which seems having high correlation with 2-3 other variables at the same time.
bank_final = bank_final %>%
select(-emp.var.rate)
bank_final = bank_final %>%
select(-euribor3m)Let's check the correlation again.
bank_num <-select_if(bank_final, is.numeric) # only selecting numeric variables
bank_num %>%
cor() %>%
corrplot(method = "number",
type = "lower",
tl.cex = 0.8,
tl.srt = 45) Now, it's okay.
4. Data Preparation for Modelling (final pre-processing)
Categorical variable to 1/0 level, Dummy variable
From now on, we will do the last pre-processing to prepare our data ideal for modelling. To use several machine learning models, it is important to transform all of our categorical variables into numerically expresssed variables. To do so, we will change variables with Yes/No 2 levels into 1/0 and variables with more than 2 levels into dummy variables.
# month
bank_final$month <- ifelse(bank_final$month == 'jan', 1,
ifelse(bank_final$month == 'feb', 2,
ifelse(bank_final$month == 'mar', 3,
ifelse(bank_final$month == 'apr', 4,
ifelse(bank_final$month == 'may', 5,
ifelse(bank_final$month == 'jun', 6,
ifelse(bank_final$month == 'jul', 7,
ifelse(bank_final$month == 'aug', 8,
ifelse(bank_final$month == 'sep', 9,
ifelse(bank_final$month == 'oct', 10,
ifelse(bank_final$month == 'nov', 11, 12)))))))))))
# day_of_week
bank_final$day_of_week <- ifelse(bank_final$day_of_week == 'mon', 1,
ifelse(bank_final$day_of_week == 'tue', 2,
ifelse(bank_final$day_of_week == 'wed', 3,
ifelse(bank_final$day_of_week == 'thu', 4, 5))))# variables with 2 levels into 1/0
bank_final$housing <- ifelse(bank_final$housing == 'no', 0, 1)
bank_final$loan <- ifelse(bank_final$loan == 'no', 0, 1)
bank_final$contact <- ifelse(bank_final$contact == 'cellular', 0, 1) #contact has only 2 levels with cellular and telephone
bank_final$previous <- ifelse(bank_final$previous == '0', 0, 1)
bank_final$y <- ifelse(bank_final$y == 'no', 0, 1)# variables with more than 2 levels into dummy variables
bank_dummies <- dummy_cols(.data=bank_final, select_columns = c("age", "job", "marital", "education", "poutcome"), remove_first_dummy = FALSE)
bank_dummies age job marital education housing loan contact month day_of_week
1 mid housemaid married basic.4y 0 0 1 5 1
campaign previous poutcome cons.price.idx cons.conf.idx nr.employed y
1 1 0 nonexistent 93.994 -36.4 5191 0
age_high age_low age_mid job_admin. job_blue-collar job_entrepreneur
1 0 0 1 0 0 0
job_housemaid job_management job_retired job_self-employed job_services
1 1 0 0 0 0
job_student job_technician job_unemployed marital_divorced marital_married
1 0 0 0 0 1
marital_single education_basic.4y education_basic.6y education_basic.9y
1 0 1 0 0
education_high.school education_professional.course
1 0 0
education_university.degree poutcome_failure poutcome_nonexistent
1 0 0 1
poutcome_success
1 0
[ reached 'max' / getOption("max.print") -- omitted 38935 rows ]
bank_ready = bank_dummies[, -c(1,2,3,4,12)]
bank_ready <- bank_ready[c(11, 1:10, 12:37)]
str(bank_ready)'data.frame': 38936 obs. of 37 variables:
$ y : num 0 0 0 0 0 0 0 0 0 0 ...
$ housing : num 0 0 1 0 0 0 0 0 1 1 ...
$ loan : num 0 0 0 0 1 0 0 0 0 0 ...
$ contact : num 1 1 1 1 1 1 1 1 1 1 ...
$ month : num 5 5 5 5 5 5 5 5 5 5 ...
$ day_of_week : num 1 1 1 1 1 1 1 1 1 1 ...
$ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
$ previous : num 0 0 0 0 0 0 0 0 0 0 ...
$ cons.price.idx : num 94 94 94 94 94 ...
$ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
$ nr.employed : num 5191 5191 5191 5191 5191 ...
$ age_high : int 0 0 0 0 0 0 0 0 0 0 ...
$ age_low : int 0 0 0 0 0 0 0 0 1 1 ...
$ age_mid : int 1 1 1 1 1 1 1 1 0 0 ...
$ job_admin. : int 0 0 0 1 0 0 1 0 0 0 ...
$ job_blue-collar : int 0 0 0 0 0 0 0 1 0 0 ...
$ job_entrepreneur : int 0 0 0 0 0 0 0 0 0 0 ...
$ job_housemaid : int 1 0 0 0 0 0 0 0 0 0 ...
$ job_management : int 0 0 0 0 0 0 0 0 0 0 ...
$ job_retired : int 0 0 0 0 0 0 0 0 0 0 ...
$ job_self-employed : int 0 0 0 0 0 0 0 0 0 0 ...
$ job_services : int 0 1 1 0 1 1 0 0 0 1 ...
$ job_student : int 0 0 0 0 0 0 0 0 0 0 ...
$ job_technician : int 0 0 0 0 0 0 0 0 1 0 ...
$ job_unemployed : int 0 0 0 0 0 0 0 0 0 0 ...
$ marital_divorced : int 0 0 0 0 0 0 0 0 0 0 ...
$ marital_married : int 1 1 1 1 1 1 1 1 0 0 ...
$ marital_single : int 0 0 0 0 0 0 0 0 1 1 ...
$ education_basic.4y : int 1 0 0 0 0 0 0 0 0 0 ...
$ education_basic.6y : int 0 0 0 1 0 0 0 0 0 0 ...
$ education_basic.9y : int 0 0 0 0 0 1 0 0 0 0 ...
$ education_high.school : int 0 1 1 0 1 0 0 0 0 1 ...
$ education_professional.course: int 0 0 0 0 0 0 1 0 1 0 ...
$ education_university.degree : int 0 0 0 0 0 0 0 1 0 0 ...
$ poutcome_failure : int 0 0 0 0 0 0 0 0 0 0 ...
$ poutcome_nonexistent : int 1 1 1 1 1 1 1 1 1 1 ...
$ poutcome_success : int 0 0 0 0 0 0 0 0 0 0 ...
Finally, our dataset is ready with numeric variables.
write.csv(bank_ready, "/Users/jiyoungkim/Desktop/SGH 2 sem/SLM class/Project/bank_final.csv")Data Splitting (Train:Test=70:30)
require(caTools)
set.seed(1)
# Train and Test
sample = sample.split(bank_ready, SplitRatio = 0.7)
bank_train = subset(bank_ready, sample==TRUE)
bank_test = subset(bank_ready, sample==FALSE)SMOTE (method to deal with imbalanced data)
We will oversample the minority group(y=1) using SMOTE method.
prop.table(table(bank_train$y)) # distribution of deposit y 0/1 in our train set
0 1
0.8858185 0.1141815
bank_train_smote <- bank_train
bank_train_smote$y = as.factor(bank_train_smote$y)
bank_train_smote <- SMOTE(form = y~., data = bank_train_smote, perc.over = 100, perc.under =200)
prop.table(table(bank_train_smote$y)) # check distribution of deposit y 0/1 in our train set after SMOTE (now, it's 50:50)
0 1
0.5 0.5
bank_test$y = as.factor(bank_test$y)5. Modelling
# Modelling Pipeline we need for visualization of Model Evaluation
# plotting importance from predictive models into two panels
fun_imp_ggplot_split = function(model){
# model: model used to plot variable importances
if (class(model)[1] == "ranger"){
imp_df = model$variable.importance %>%
data.frame("Overall" = .) %>%
rownames_to_column() %>%
rename(variable = rowname) %>%
arrange(-Overall)
} else {
imp_df = varImp(model) %>%
rownames_to_column() %>%
rename(variable = rowname) %>%
arrange(-Overall)
}
# first panel (half most important variables)
gg1 = imp_df %>%
slice(1:floor(nrow(.)/2)) %>%
ggplot() +
aes(x = reorder(variable, Overall), weight = Overall, fill = -Overall) +
geom_bar() +
coord_flip() +
xlab("Variables") +
ylab("Importance") +
theme(legend.position = "none")
imp_range = ggplot_build(gg1)[["layout"]][["panel_params"]][[1]][["x.range"]]
imp_gradient = scale_fill_gradient(limits = c(-imp_range[2], -imp_range[1]),
low = "#132B43",
high = "#56B1F7")
# second panel (less important variables)
gg2 = imp_df %>%
slice(floor(nrow(.)/2)+1:nrow(.)) %>%
ggplot() +
aes(x = reorder(variable, Overall), weight = Overall, fill = -Overall) +
geom_bar() +
coord_flip() +
xlab("") +
ylab("Importance") +
theme(legend.position = "none") +
ylim(imp_range) +
imp_gradient
# arranging together
gg_both = plot_grid(gg1 + imp_gradient,
gg2)
return(gg_both)
}
# plotting two performance measures
fun_gg_cutoff = function(score, obs, measure1, measure2) {
# score: predicted scores
# obs: real classes
# measure1, measure2: which performance metrics to plot
predictions = prediction(score, obs)
performance1 = performance(predictions, measure1)
performance2 = performance(predictions, measure2)
df1 = data.frame(x = performance1@x.values[[1]],
y = performance1@y.values[[1]],
measure = measure1,
stringsAsFactors = F) %>%
drop_na()
df2 = data.frame(x = performance2@x.values[[1]],
y = performance2@y.values[[1]],
measure = measure2,
stringsAsFactors = F) %>%
drop_na()
# df contains all the data needed to plot both curves
df = df1 %>%
bind_rows(df2)
# extracting best cut for each measure
y_max_measure1 = max(df1$y, na.rm = T)
x_max_measure1 = df1[df1$y == y_max_measure1, "x"][1]
y_max_measure2 = max(df2$y, na.rm = T)
x_max_measure2 = df2[df2$y == y_max_measure2, "x"][1]
txt_measure1 = paste("Best cut for", measure1, ": x =", round(x_max_measure1, 3))
txt_measure2 = paste("Best cut for", measure2, ": x =", round(x_max_measure2, 3))
txt_tot = paste(txt_measure1, "\n", txt_measure2, sep = "")
# plotting both measures in the same plot, with some detail around.
gg = df %>%
ggplot() +
aes(x = x,
y = y,
colour = measure) +
geom_line() +
geom_vline(xintercept = c(x_max_measure1, x_max_measure2), linetype = "dashed", color = "gray") +
geom_hline(yintercept = c(y_max_measure1, y_max_measure2), linetype = "dashed", color = "gray") +
labs(caption = txt_tot) +
theme(plot.caption = element_text(hjust = 0)) +
xlim(c(0, 1)) +
ylab("") +
xlab("Threshold")
return(gg)
}
# creating classes according to score and cut
fun_cut_predict = function(score, cut) {
# score: predicted scores
# cut: threshold for classification
classes = score
classes[classes > cut] = 1
classes[classes <= cut] = 0
classes = as.factor(classes)
return(classes)
}
# computing AUPR
aucpr = function(obs, score){
# obs: real classes
# score: predicted scores
df = data.frame("pred" = score,
"obs" = obs)
prc = pr.curve(df[df$obs == 1, ]$pred,
df[df$obs == 0, ]$pred)
return(prc$auc.davis.goadrich)
}
# plotting PR curve
gg_prcurve = function(df) {
# df: df containing models scores by columns and the last column must be
# nammed "obs" and must contain real classes.
# init
df_gg = data.frame("v1" = numeric(),
"v2" = numeric(),
"v3" = numeric(),
"model" = character(),
stringsAsFactors = F)
# individual pr curves
for (i in c(1:(ncol(df)-1))) {
x1 = df[df$obs == 1, i]
x2 = df[df$obs == 0, i]
prc = pr.curve(x1, x2, curve = T)
df_prc = as.data.frame(prc$curve, stringsAsFactors = F) %>%
mutate(model = colnames(df)[i])
# combining pr curves
df_gg = bind_rows(df_gg,
df_prc)
}
gg = df_gg %>%
ggplot() +
aes(x = V1, y = V2, colour = model) +
geom_line() +
xlab("Recall") +
ylab("Precision")
return(gg)
}5-1) Logistic Regression
Let's start building logistic regression model first.
logistic <- glm(y~., data=bank_train_smote, family="binomial")
summary(logistic)
Call:
glm(formula = y ~ ., family = "binomial", data = bank_train_smote)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8234 -0.8749 -0.1629 0.9241 2.0497
Coefficients: (6 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 35.3904625 4.3147153 8.202 2.36e-16 ***
housing -0.0584067 0.0432331 -1.351 0.176704
loan -0.1612413 0.0607822 -2.653 0.007983 **
contact -0.7896574 0.0661515 -11.937 < 2e-16 ***
month 0.0002514 0.0127347 0.020 0.984251
day_of_week 0.0263697 0.0155738 1.693 0.090415 .
campaign -0.0444649 0.0138909 -3.201 0.001370 **
previous 1.2204701 0.1292557 9.442 < 2e-16 ***
cons.price.idx 0.2475383 0.0532245 4.651 3.31e-06 ***
cons.conf.idx 0.0125578 0.0052807 2.378 0.017403 *
nr.employed -0.0111821 0.0004108 -27.218 < 2e-16 ***
age_high 0.5511016 0.1645966 3.348 0.000813 ***
age_low 0.1358729 0.0599678 2.266 0.023466 *
age_mid NA NA NA NA
job_admin. -0.2050983 0.1397604 -1.467 0.142240
[ reached getOption("max.print") -- omitted 22 rows ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16658 on 12015 degrees of freedom
Residual deviance: 13225 on 11985 degrees of freedom
AIC: 13287
Number of Fisher Scoring iterations: 5
We see some variables with high p-value(>0.05) which means not significant variable in our model. Therefore, we will create a model with only significant variables in the next step.
Choose appropriate Cut/Threshold
Before getting into the model evaluation part, let's think of what metrics are proper for us to check the model performance in this telemarketing business problem. For this situation, the most important priority is 'if we can precisely find clients who are willing to subscribe to a Deposit product (TP=true positive)'. And at the same time, we also need to consider 'if we didn't miss potential clients who may actually subscribe for Deposit but classified as they will not' (FN=False Negative). Considering this point, I think F-score is an ideal metric we can use in here in that it is typically used to measure how accurate the 'classification' is made. When F-score high, the Precision and Recall is also high. Therefore, I will more focus on the F score(f1 score) which consider both of Precision and Recall than Accuracy.
logistic_train_score = predict(logistic, newdata = bank_train_smote, type = "response")
logistic_test_score = predict(logistic, newdata = bank_test, type = "response")measure_train = fun_gg_cutoff(logistic_train_score, bank_train_smote$y,
"acc", "f")
measure_train +
geom_vline(xintercept = c(0.4, 0.5),
linetype = "dashed") We can see the best cut(threshold) for both F score and Accuracy in training set prediction is 0.4.
Then, let's check real test set threshold.
logistic_train_cut = 0.4
measure_test = fun_gg_cutoff(logistic_test_score, bank_test$y, "acc", "f")
measure_test + geom_vline(xintercept = c(0.5, logistic_train_cut), linetype = "dashed") Considering the amount of Tradeoff between Accuracy and F-score, it's better to choose our threshold as '0.68' for test set prediction. (because it shows the highest F1 score with quite high accuracy too.)
Confusion Matrix & Model Evaluation Metrics
logistic_test_class = fun_cut_predict(logistic_test_score, 0.68)
# matrix
logistic_test_confm = confusionMatrix(logistic_test_class, bank_test$y,
positive = "1",
mode = "everything")
logistic_test_confmConfusion Matrix and Statistics
Reference
Prediction 0 1
0 10364 755
1 811 697
Accuracy : 0.876
95% CI : (0.8701, 0.8817)
No Information Rate : 0.885
P-Value [Acc > NIR] : 0.9992
Kappa : 0.4007
Mcnemar's Test P-Value : 0.1646
Sensitivity : 0.4800
Specificity : 0.9274
Pos Pred Value : 0.4622
Neg Pred Value : 0.9321
Precision : 0.4622
Recall : 0.4800
F1 : 0.4709
Prevalence : 0.1150
Detection Rate : 0.0552
Detection Prevalence : 0.1194
Balanced Accuracy : 0.7037
'Positive' Class : 1
5-2) Logistic Regression with selected variables (Stepwise selection)
logistic2 <- step(logistic, direction = "both")summary(logistic2)
Call:
glm(formula = y ~ loan + contact + day_of_week + campaign + previous +
cons.price.idx + cons.conf.idx + nr.employed + age_high +
age_low + job_admin. + job_entrepreneur + job_retired + job_services +
job_student + marital_divorced + education_basic.4y + education_basic.6y +
education_basic.9y + education_high.school + education_professional.course +
poutcome_failure, family = "binomial", data = bank_train_smote)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8062 -0.8767 -0.1690 0.9229 2.0473
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 35.0189570 4.2976897 8.148 3.69e-16 ***
loan -0.1675055 0.0606105 -2.764 0.00572 **
contact -0.7883827 0.0616705 -12.784 < 2e-16 ***
day_of_week 0.0268959 0.0155646 1.728 0.08399 .
campaign -0.0437198 0.0138670 -3.153 0.00162 **
previous 1.2216896 0.1289179 9.476 < 2e-16 ***
cons.price.idx 0.2502793 0.0531471 4.709 2.49e-06 ***
cons.conf.idx 0.0125415 0.0045088 2.782 0.00541 **
nr.employed -0.0111892 0.0003919 -28.553 < 2e-16 ***
age_high 0.5355574 0.1644172 3.257 0.00112 **
age_low 0.1513121 0.0568551 2.661 0.00778 **
job_admin. -0.1073273 0.0555672 -1.931 0.05342 .
job_entrepreneur -0.1765985 0.1198463 -1.474 0.14061
job_retired 0.3571561 0.1240757 2.879 0.00400 **
job_services -0.1466066 0.0858669 -1.707 0.08775 .
[ reached getOption("max.print") -- omitted 8 rows ]
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16658 on 12015 degrees of freedom
Residual deviance: 13230 on 11993 degrees of freedom
AIC: 13276
Number of Fisher Scoring iterations: 5
As we can see above, through Stepwise variable selection, there are only significant variables left in our LR model (Except for eduaction_basic.4y)
logistic2_train_score = predict(logistic2, newdata = bank_train_smote, type = "response")
logistic2_test_score = predict(logistic2, newdata = bank_test, type = "response")measure_train2 = fun_gg_cutoff(logistic2_train_score, bank_train_smote$y, "acc", "f")
measure_train2 + geom_vline(xintercept = c(0.4, 0.5), linetype = "dashed")We can see the best cut(threshold) for both F score and Accuracy in training set prediction is 0.4.
Then, let's check real test set threshold.
logistic2_train_cut = 0.4
measure_test2 = fun_gg_cutoff(logistic2_test_score, bank_test$y, "acc", "f")
measure_test2 + geom_vline(xintercept = c(0.5, logistic2_train_cut), linetype = "dashed") For test set prediction, the ideal threshold should be 0.67 considering it gives us the highest F-score with moderate accuracy.
Confusion Matrix & Model Evaluation Metrics
logistic2_test_class = fun_cut_predict(logistic2_test_score, 0.67)
# matrix
logistic2_test_confm = confusionMatrix(logistic2_test_class, bank_test$y,
positive = "1",
mode = "everything")
logistic2_test_confmConfusion Matrix and Statistics
Reference
Prediction 0 1
0 10305 733
1 870 719
Accuracy : 0.873
95% CI : (0.8671, 0.8788)
No Information Rate : 0.885
P-Value [Acc > NIR] : 0.9999847
Kappa : 0.4009
Mcnemar's Test P-Value : 0.0006817
Sensitivity : 0.49518
Specificity : 0.92215
Pos Pred Value : 0.45249
Neg Pred Value : 0.93359
Precision : 0.45249
Recall : 0.49518
F1 : 0.47287
Prevalence : 0.11499
Detection Rate : 0.05694
Detection Prevalence : 0.12584
Balanced Accuracy : 0.70866
'Positive' Class : 1
Comparing to the first full LR model, we can see that most of the metrics didn't change that much. Metrics like Sensitivity, F1 Score, Recall have increased a bit than the first full model.
5-3) Decision Tree
There are 4 representative parameters we choose in Decision Tree. cp(Complexity Parameter), minsplit(min of observations), xval(of closs validation), maxdepth(max depth of any node). In here, the lesser cp is, the complexity increases. Therefore, finding the optimal CP is important. Considering this, we will proceed parameter tuning for this cp.
set.seed(1004)
dtree1 <- rpart(y~., data = bank_train_smote, cp = 0.1^20) # full tree model
rpart.plot(dtree1)# Find the optimal CP
dtree1$cptable CP nsplit rel error xerror xstd
1 4.184421e-01 0 1.0000000 1.0234687 0.009120117
2 3.728362e-02 1 0.5815579 0.5815579 0.008285569
3 3.012650e-02 2 0.5442743 0.5545939 0.008167752
4 2.521638e-02 4 0.4840213 0.4841877 0.007815372
5 5.825566e-03 6 0.4335885 0.4339214 0.007520252
6 1.830892e-03 12 0.3974700 0.3981358 0.007285318
7 1.747670e-03 13 0.3956391 0.3974700 0.007280737
8 1.553484e-03 15 0.3921438 0.3963049 0.007272700
9 1.539614e-03 20 0.3839880 0.3916445 0.007240308
10 1.081891e-03 26 0.3740013 0.3839880 0.007186231
11 9.986684e-04 31 0.3673435 0.3814913 0.007168361
12 9.154461e-04 34 0.3643475 0.3771638 0.007137110
13 8.877053e-04 36 0.3625166 0.3761651 0.007129847
14 8.322237e-04 39 0.3598535 0.3759987 0.007128635
15 7.490013e-04 40 0.3590213 0.3761651 0.007129847
[ reached getOption("max.print") -- omitted 26 rows ]
plotcp(dtree1)cp_optimal <- dtree1$cptable[which(dtree1$cptable[,"xerror"]==min(dtree1$cptable[,"xerror"])),"CP"]cp_optimal 16 24
0.0005825566 0.0003120839
Somehow, 2 optimal values for cp came out (0.0005825566 and 0.0003120839), so i will use a bit larger one which is 0.0005825566 as our optimal cp.
dtree_optimal <- prune(dtree1, 0.0005825566) #Plot of prunned tree model with optimal cp
rpart.plot(dtree_optimal) As you can see, we still have too many variables which are not significant actually. Therefore, we will build another one more decision tree model with filtered variables based on this importance plot as below.
fun_imp_ggplot_split(dtree_optimal)tree_train_score = predict(dtree_optimal,newdata = bank_train_smote,type = "prob")[, 2]
tree_test_score = predict(dtree_optimal, newdata = bank_test,type = "prob")[, 2]Choose appropriate Cut/Threshold
Trainset
measure_train = fun_gg_cutoff(tree_train_score, bank_train_smote$y,"acc", "f")
measure_train + geom_vline(xintercept = c(0.4, 0.5), linetype = "dashed") You can see that the best cut for F-score is around 0.4 for training set' prediction.
Testset
tree_train_cut = 0.4
measure_test = fun_gg_cutoff(tree_test_score, bank_test$y, "acc", "f")
measure_test + geom_vline(xintercept = c(tree_train_cut, 0.5), linetype = "dashed") In test set' prediction, best threshold figure for F-score is 0.6. Let's check the confusion matrix and metrics of this tree model.
Confusion Matrix & Model Evaluation Metrics
tree_test_class = fun_cut_predict(tree_test_score, 0.6)
tree_test_confm = confusionMatrix(tree_test_class, bank_test$y, positive = "1", mode = "everything")
tree_test_confmConfusion Matrix and Statistics
Reference
Prediction 0 1
0 10069 587
1 1106 865
Accuracy : 0.8659
95% CI : (0.8599, 0.8718)
No Information Rate : 0.885
P-Value [Acc > NIR] : 1
Kappa : 0.4299
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.5957
Specificity : 0.9010
Pos Pred Value : 0.4389
Neg Pred Value : 0.9449
Precision : 0.4389
Recall : 0.5957
F1 : 0.5054
Prevalence : 0.1150
Detection Rate : 0.0685
Detection Prevalence : 0.1561
Balanced Accuracy : 0.7484
'Positive' Class : 1
Comparing this Tree model to the last LR2 model, this tree model shows some better metrics in F1 score, Recall, and Sensitivity. Let's compare this model's performance with other models in the next step "Model Evaluation"
5-4) Decision Tree with selected variables (based on variable importance)
This time, I'll only include top 10 variables in terms of variable importance that we saw above.
set.seed(1004)
dtree2 <- rpart(y~nr.employed+cons.conf.idx+cons.price.idx+contact+month+poutcome_success+campaign+day_of_week+poutcome_failure+education_university.degree, data = bank_train_smote, cp = 0.1^20) # tree model with filtered variables based on their importance
rpart.plot(dtree2)# Find the optimal CP
dtree2$cptable CP nsplit rel error xerror xstd
1 4.184421e-01 0 1.0000000 1.0234687 0.009120117
2 3.728362e-02 1 0.5815579 0.5815579 0.008285569
3 3.012650e-02 2 0.5442743 0.5545939 0.008167752
4 2.521638e-02 4 0.4840213 0.4841877 0.007815372
5 5.825566e-03 6 0.4335885 0.4339214 0.007520252
6 1.747670e-03 12 0.3974700 0.3981358 0.007285318
7 1.553484e-03 14 0.3939747 0.3959720 0.007270399
8 1.539614e-03 19 0.3858189 0.3911451 0.007236815
9 1.331558e-03 25 0.3758322 0.3858189 0.007199261
10 1.081891e-03 27 0.3731691 0.3791611 0.007151578
11 9.986684e-04 32 0.3665113 0.3748336 0.007120134
12 7.490013e-04 40 0.3576897 0.3681758 0.007071057
13 6.102974e-04 44 0.3546937 0.3651798 0.007048690
14 5.825566e-04 50 0.3510320 0.3615180 0.007021111
15 4.993342e-04 54 0.3487017 0.3601864 0.007011016
[ reached getOption("max.print") -- omitted 14 rows ]
plotcp(dtree2)cp_optimal <- dtree1$cptable[which(dtree2$cptable[,"xerror"]==min(dtree2$cptable[,"xerror"])),"CP"]cp_optimal[1] 0.0004993342
dtree2_optimal <- prune(dtree2, cp_optimal) #Plot of prunned tree model with optimal cp
rpart.plot(dtree2_optimal)tree2_train_score = predict(dtree2_optimal, newdata = bank_train_smote,type = "prob")[, 2]
tree2_test_score = predict(dtree2_optimal, newdata = bank_test,type = "prob")[, 2]Choose appropriate Cut/Threshold
Trainset
measure_train2 = fun_gg_cutoff(tree2_train_score, bank_train_smote$y,"acc", "f")
measure_train2 + geom_vline(xintercept = c(0.4, 0.5), linetype = "dashed") You can see that the best cut for F-score is around 0.4 for training set' prediction.
Testset
tree2_train_cut = 0.4
measure_test2 = fun_gg_cutoff(tree2_test_score, bank_test$y, "acc", "f")
measure_test2 + geom_vline(xintercept = c(tree2_train_cut, 0.5), linetype = "dashed") In test set' prediction, best threshold figure for F-score is 0.7. Let's check the confusion matrix and metrics of this tree model.
Confusion Matrix & Model Evaluation Metrics
tree2_test_class = fun_cut_predict(tree2_test_score, 0.7)
tree2_test_confm = confusionMatrix(tree2_test_class, bank_test$y, positive = "1", mode = "everything")
tree2_test_confmConfusion Matrix and Statistics
Reference
Prediction 0 1
0 10314 666
1 861 786
Accuracy : 0.8791
95% CI : (0.8733, 0.8847)
No Information Rate : 0.885
P-Value [Acc > NIR] : 0.9819
Kappa : 0.4386
Mcnemar's Test P-Value : 6.885e-07
Sensitivity : 0.54132
Specificity : 0.92295
Pos Pred Value : 0.47723
Neg Pred Value : 0.93934
Precision : 0.47723
Recall : 0.54132
F1 : 0.50726
Prevalence : 0.11499
Detection Rate : 0.06225
Detection Prevalence : 0.13043
Balanced Accuracy : 0.73214
'Positive' Class : 1
Comparing to the previous DT1 Model with full variable, this one shows only better metrics in Accuracy, Sensitivity, Specificity, Precision and F1 score. However, in overall, nothing much severe gap between these two models.
5-5) K-nearest Neighbor
Last model is KNN, a non-parametric classification method.
knn <- train(y~., data=bank_train_smote, method ="knn", trControl = trainControl("cv", number = 5), preProcess = c("range"), tuneLength=10)
plot(knn)pred.knn <- predict(knn, newdata=bank_test)### Confusion matrix
knn_test_confm = confusionMatrix(pred.knn, bank_test$y, positive = "1", mode = "everything")
knn_test_confmConfusion Matrix and Statistics
Reference
Prediction 0 1
0 7807 562
1 3368 890
Accuracy : 0.6888
95% CI : (0.6806, 0.6968)
No Information Rate : 0.885
P-Value [Acc > NIR] : 1
Kappa : 0.1693
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.61295
Specificity : 0.69861
Pos Pred Value : 0.20902
Neg Pred Value : 0.93285
Precision : 0.20902
Recall : 0.61295
F1 : 0.31173
Prevalence : 0.11499
Detection Rate : 0.07048
Detection Prevalence : 0.33721
Balanced Accuracy : 0.65578
'Positive' Class : 1
Based on this result, let's compare all models together in the next Step.
6. Model Evaluation
ROC Curve & Precision-Recall AUC Curve
score_test = data.frame("Logistic Regression Full" = logistic_test_score,
"Logistic Regression Stepwise" = logistic2_test_score,
"Decision Tree Full" = tree_test_score,
"Decision Tree Var selected(imp)" = tree2_test_score,
"obs" = as.numeric(bank_test$y) - 1)
roc_test = score_test %>%
gather(key = "Method", value = "score", -obs) %>%
ggplot() +
aes(d = obs,
m = score,
color = Method) +
geom_roc(labels = F, pointsize = 0, size = 0.6) +
xlab("Specificity") +
ylab("Sensitivity") +
ggtitle("ROC Curve", subtitle = "Validation dataset")
prcurve_test = gg_prcurve(score_test) + ggtitle("PR Curve", subtitle = "Validation dataset")
curves_test = ggarrange(roc_test, prcurve_test,
common.legend = T,
legend = "bottom")print(curves_test) I wanted to show the knn model at the same time on the plot, but not available to drive ROC, AUC attribute from knn model due to the internal error.
df_final = data.frame("Model" = c("Logistic Regression Full",
"Logistic Regression Stepwise",
"Decision Tree Full",
"Decision Tree Var selected(imp)",
"KNN"),
"AUROC" = c(auc(bank_test$y, logistic_test_score),
auc(bank_test$y, logistic2_test_score),
auc(bank_test$y, tree_test_score),
auc(bank_test$y, tree2_test_score),
NA),
"AUPR" = c(aucpr(bank_test$y, logistic_test_score),
aucpr(bank_test$y, logistic2_test_score),
aucpr(bank_test$y, tree_test_score),
aucpr(bank_test$y, tree2_test_score), NA),
"Cut" = c(0.68, 0.67, 0.6, 0.7,NA),
"Accuracy" = c(logistic_test_confm[["overall"]][["Accuracy"]],
logistic2_test_confm[["overall"]][["Accuracy"]],
tree_test_confm[["overall"]][["Accuracy"]],
tree2_test_confm[["overall"]][["Accuracy"]], 0.6896),
"F1" = c(logistic_test_confm[["byClass"]][["F1"]],
logistic2_test_confm[["byClass"]][["F1"]],
tree_test_confm[["byClass"]][["F1"]],
tree2_test_confm[["byClass"]][["F1"]], 0.31180),
stringsAsFactors = F)
df_final %>%
arrange(-F1) Model AUROC AUPR Cut Accuracy F1
1 Decision Tree Var selected(imp) 0.7916969 0.4324922 0.70 0.8790687 0.5072604
2 Decision Tree Full 0.7913294 0.4320732 0.60 0.8659222 0.5054046
3 Logistic Regression Stepwise 0.7762411 0.4381652 0.67 0.8730498 0.4728708
4 Logistic Regression Full 0.7759985 0.4369693 0.68 0.8759800 0.4709459
5 KNN NA NA NA 0.6896000 0.3118000
(*AUROC = area under the ROC curve, AUPR = area under the precision-recall curve)
Based on the above results, we can say the Decision Tree model with Variable selected using variable importance has the highes performance in terms of F1 score and Accuracy. As we discussed at the beginning of modelling, our ideal metrics to evaluate model's performance were Precision, Recall and F1 Score than the Accuracy itself. Becuase telemarketing business is an area in need to precisely target the right clients as possible but at the same time, not losing the potential clients.
Therefore, we can say the best model is Decision Tree model with selected variables.
7. Conclusion / Summary
Through whole process of this bank telemarketing data analysis project, I could realize the importance of 'targeting' and 'segmenting' in the marketing campaign. Especially through EDA part, I can gain a lot of insights regarding what can be the important and meaningful factor of sucessful bank telemarketing.
There were several challenges I encountered during the project as follows:
1) There were quite large amount of "unknown" value, so I had to drop some of them. It was not that much sacrifice since I checked distribution of 'Deposit(y)' for each variables before I drop them. Still, it might be better to have less unknown values in the dataset for the better analysis.;
2) Since our target variable 'y' was so imbalanced as 0:1 = 89%:11%, the original data itself is not that good for analysis. To deal with this issue, I did SMOTE(Synthetic Minority Oversampling Technique). However, this method still has potential risk of overfitting due to over sampling the minority. Luckily, in our modelling part, I did not identify any severe overfitting, but still, SMOTE itself is not the perfect method for dealing with imbalanced data.;
3) By looking back on the process, I realized that I couldn't drive good insights or use social & economic variables (which were mostly numeric) in our analysis. It might be better if I could also check those variables in detail.
Social & Economic Variables:
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)